ORLA 6541 - Exercise 1

The following code and write-up contains answers to in-text exercises in Wickham & Grolemund (2017): Chapters 3-6 and Bulut & Dejardins (2019) Chapter 4.

Wickham & Grolemund (2017):

Chapter 3: Data Visualization

3.2.4 Exercises

1. Run ggplot(data = mpg). What do you see?

A blank grey square

library(tidyverse)

ggplot(data=mpg)

2. How many rows are in mpg? How many columns?

234 rows and 11 columns

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

3. What does the drv variable describe? Read the help for ?mpg to find out.

The drv variable describes the type of drive train for each car in the data set, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.

?mpg

4. Make a scatterplot of hwy vs cyl.

ggplot(data = mpg)+geom_point(mapping=aes(x=cyl, y=hwy))

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

Since both variables are categorical, the points align based on the intersection of both categories, which does not provide any useful information. Scatterplots are more useful for mapping the relationship between two continuous variables.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

## 3.3.1 Exercises #### 1. What’s gone wrong with this code? Why are the points not blue? ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = “blue”))

aes() is used to map an aesthetic to a variable. In this case, we wanted to associate the aesthetic color and with the actual color “blue”. Since “blue” is not a variable, we cannot map it in aesthetic, and had to set the aesthetic properties of our geom manually (outside of aes() . As such:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

?mpg

Categorical: manufacturer, model, trans, drv, fl, class Continuous: displ, year, cyl, cty, hwy

You can see this information when running ?mpg under each variable’s name

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Assigning color to a continuous variable provides a numerical scale of colors while assigning color to a categorical variable categorizes all levels of the variable by different colors.

Similarly, assigning size to a continuous variable provides a numerical scale of sizes, while assigning size to a categorical variable (although not advised), categorizes all levels of the variable by different sizes.

Continuous variables can not be mapped to shape. While assigning shape to a categorical variable, all levels of the variable by different shapes (with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate)

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = cty))

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, size = cty))

4. What happens if you map the same variable to multiple aesthetics?

You get a scatterpllot with an overlap of two aesthetics, both convey the same information.

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = cty, alpha=cty))

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

It is used to modify the width of the border of shapes with borders.

?geom_point

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

It categorizes the variable assigned to the x-axis (in this case, displ) to < or > 5 by two different colors.

ggplot(data = mpg) + 
    geom_point(mapping = aes(x = displ, y = hwy, color = displ<5))

3.5.1 Exercises

1. What happens if you facet on a continuous variable?

We get a new column for each value of the variable.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 2)

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

The empty cells in facet_grid suggest that the variables did not intersect within these values. This relates to the first plot because we can notice that in the first plot, there is much empty space and if we were to facet it, we are likely to have empty cells as well.

3. What plots does the following code make? What does . do?

. allows us to facet the plot by all levels of the variable provided without needing to specify the levels.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

4. Take the first faceted plot in this section: What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Advantages: it is easier to identify patterns within each value of a variable; we can have a larger number of facets while still being able to identify which facet represents which value, while with a large number of colors, it may be difficult to discriminate between values. Disadvantages: It may be more challenging to compare patterns between values of the variable and to identify overall patterns.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

?facet_wrap

nrow determines the number of rows in the facet. Other options that control the layout of the individual panels are: ncol, scale, and dir. ncol determines the number of columns in the facet. facet_grid() does not have number of nrow and ncol arguments because they are determined by the number of unique levels the variables have.

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Because if you put the variable with the more unique levels in the rows, it would create a taller then wider plot, which may be hard to follow and interpret on most computers.

3.6.1 Exercises

1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chart - geom_line() boxplot - geom_boxplot() histogram - geom_histogram()

2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

This will create a scatter plot with displ as the x variable and hwy as the y variable. drv will be represented by both lines and errors colored by the levels of the variable, and their standard error would not appear.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

show.legend = FALSE hides the legend box in the of the plot. It plausible that it was removed to ensure the plot’s size is consistent with previous plots.

4. What does the se argument to geom_smooth() do?

It visulizes the confidence intervals of the line as shaded areas; TRUE to show, FALSE to hide.

5. Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

No, they will be identical since Because both geom_point() and geom_smooth() will uses the same data and mapping noted in the ggplot(), which is the same for both plots.

6. Recreate the R code necessary to generate the following graphs.

ggplot(data=mpg, mapping=aes(x = displ, y = hwy,)) +
  geom_point(size=5) +
  geom_smooth(se = FALSE, size=2)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
     geom_smooth(mapping = aes(group = drv), se = FALSE, size=2) +
  geom_point(size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv))+
     geom_smooth(se = FALSE, size=2) +
  geom_point(size=5)

ggplot(data = mpg, mapping = aes (x = displ, y = hwy))+
     geom_smooth(se = FALSE, size=2) +
  geom_point(mapping=aes(color=drv), size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
     geom_smooth(mapping=aes(linetype=drv), se = FALSE, size=2) +
  geom_point(mapping = aes(color=drv), size=5)

 ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_point(size = 5, color = "white") +
  geom_point(aes(color = drv))

3.7.1 Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

?stat_summary

the default geom associated with stat_summary is geom_pointrange().

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

2. What does geom_col() do? How is it different to geom_bar()?

?geom_col
?geom_bar

There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Pairs of geoms and stats share names in common and typically have the same default stat.

4. What variables does stat_smooth() compute? What parameters control its behaviour?

?stat_smooth

Two parameters that control stat_smooth()'s behavior are method and formula. Method specifies the kind of smoothing function to apply, like a linear model (lm) or a generalized linear model (glm), and formula allows the user to provide the variables of interest for the given modeling strategy. In addition, users can also adjust the fullrange parameter, which tells R to either fit the smoothing for just the data or the full plot.

stat_smooth() generates the following variables:

y or x predicted value of the given variable(s)

ymin or xmin lower pointwise confidence interval around the mean

ymax or xmax upper pointwise confidence interval around the mean

se standard error

5.In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

The problems is that all of the bars are the same height, which is inaccurate. Setting group=1 ensures that the proportions for each stack are calculated using each subset.

3.8.1 Exercises

1. What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

It is possible that many points in the plot overlap and, therefore, hide each other. We can improve the plot by using the geom_jitter() function to add a small amount of random noise to each point, which will reveal any overlapping points. As such:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

2. What parameters to geom_jitter() control the amount of jittering?

Width, controlling the horizontal random noise, and height, controlling the vertical random noise.

?geom_jitter

3. Compare and contrast geom_jitter() with geom_count().

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_count()

While geom_jitter adds random noise to each point to reduce over plotting of points, geom_count changes the size of each point per the number of observation for its specific value. Although geom_count can help to reduce over plotting; however, it may create over plotting by itself due to the size of the points.

4. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

?geom_boxplot

The default position adjustment for geom_boxplot() is dodge2.

ggplot(data = mpg, aes(x = drv, y = hwy)) +
  geom_boxplot()

3.9.1 Exercises

1. Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(mpg, aes(x = class, fill = drv)) + geom_bar()

ggplot(mpg, aes(x = class, fill = drv)) + geom_bar() + coord_polar()

2. What does labs() do? Read the documentation.

?labs()

labs() allows users to set plot labels. Within the labs() function, users can provide details for host of plot labels, like title, subtitle, caption, and more.

3. What’s the difference between coord_quickmap() and coord_map()?

?coord_quickmap

coord_map() from ggplot2 allows users to map their data onto a 2D projection of the earth. Due to the curvature of the earth, the documentation notes that map projections do not preserve straigth lines. As such, coord_quickmap() can be used alternatively, which performs calculations to provide an approximation of the requested area of the globe but with straight lines.

4. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

The plot below suggests a strong positive relationship between city miles per gallon to highway miles per gallon; as one rises, so does that other.

coord_fixed() is important because it ensures a fixed ratio between the physical representation of data points on the X and Y axes. This makes it easier to identify patterns in the plot.

geom_abline() adds a reference line to a plot, either horizontal, vertical, or diagonal, which is useful for annotating plots. In this case, it added a horizontal line to the plot, which made it easier to identify the relationship between city and highway mpg.

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  coord_fixed()

Wickham & Grolemund Chapter 4

Workflow: basics

4.4 Exercises

1. Why does this code not work?

my_variable <- 10 my_varıable #> Error in eval(expr, envir, enclos): object ‘my_varıable’ not found

This code does not work because the “I” in the second row of the code is capital, which does not match the “i” above.

2. Tweak each of the following R commands so that they run correctly:

library(tidyverse)

ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8) filter(diamond, carat > 3)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

filter(mpg, cyl == 8)
## # A tibble: 70 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a6 quattro   4.2  2008     8 auto… 4        16    23 p     mids…
##  2 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  3 chevrolet    c1500 sub…   5.3  2008     8 auto… r        11    15 e     suv  
##  4 chevrolet    c1500 sub…   5.3  2008     8 auto… r        14    20 r     suv  
##  5 chevrolet    c1500 sub…   5.7  1999     8 auto… r        13    17 r     suv  
##  6 chevrolet    c1500 sub…   6    2008     8 auto… r        12    17 r     suv  
##  7 chevrolet    corvette     5.7  1999     8 manu… r        16    26 p     2sea…
##  8 chevrolet    corvette     5.7  1999     8 auto… r        15    23 p     2sea…
##  9 chevrolet    corvette     6.2  2008     8 manu… r        16    26 p     2sea…
## 10 chevrolet    corvette     6.2  2008     8 auto… r        15    25 p     2sea…
## # … with 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
##    carat cut     color clarity depth table price     x     y     z
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  3.01 Premium I     I1       62.7    58  8040  9.1   8.97  5.67
##  2  3.11 Fair    J     I1       65.9    57  9823  9.15  9.02  5.98
##  3  3.01 Premium F     I1       62.2    56  9925  9.24  9.13  5.73
##  4  3.05 Premium E     I1       60.9    58 10453  9.26  9.25  5.66
##  5  3.02 Fair    I     I1       65.2    56 10577  9.11  9.02  5.91
##  6  3.01 Fair    H     I1       56.1    62 10761  9.54  9.38  5.31
##  7  3.65 Fair    H     I1       67.1    53 11668  9.53  9.48  6.38
##  8  3.24 Premium H     I1       62.1    58 12300  9.44  9.4   5.85
##  9  3.22 Ideal   I     I1       62.6    55 12545  9.49  9.42  5.92
## 10  3.5  Ideal   H     I1       62.8    57 12587  9.65  9.59  6.03
## # … with 22 more rows

3. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

A menu with all shortcuts appear. I can access this menu through Tools, keyboard shortcuts helps.

library(tidyverse) 

Wickham & Grolemund Ch. 6

Workflow: scipts

6.3 Exercises

1. Go to the RStudio Tips twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

Tip from: Dr. Rebecca Hirst @HirstR omg did you know you can access #R #cheatsheets straight from Rstudio?!

I’m heading back under my rock now 🙈 #coding #rstats

To follow Dr. Hirst’s tip, in RStudio, go to: File > Help > Cheat Sheets

2. What other common mistakes will RStudio diagnostics report? Read https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics to find out.

Warn if variable used has no definition in scope: Warn if a symbol is used with no definition in the current, or parent, scope. Warn if variable is defined but not used: helps to identify is a variable is created but never used. Provide R style diagnostics (e.g. whitespace): checks to see if your code conforms to Hadley Wickham’s style guide, and reports style warnings when encountered.

Wickham & Grolemund Chapter 5

Load relevant packages

library(nycflights13)
library(tidyverse)

5.2.4

1.1 Find all flights that:

- Had an arrival delay of two or more hours
- Flew to Houston (IAH or HOU)
- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
  • There were 10,200 flights that had an arrival delay of two or more hours
filter(flights, arr_delay >= 120)
## # A tibble: 10,200 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # … with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.2.

  • 9,313 flights flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
## # A tibble: 9,313 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # … with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.3.

  • 139,504 flights were operated by United (UA), American (AA), or Delta (DL)
filter(flights, carrier %in% c("UA", "AA", "DL"))
## # A tibble: 139,504 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # … with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.4.

  • 86,326 flights departed in summer (July, August, and September)
filter(flights, month %in% c(7, 8, 9))
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.5.

  • 29 flights arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 120, dep_delay <= 0)
## # A tibble: 29 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # … with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.6.

  • 206 flights were delayed by at least an hour, but made up over 30 minutes in flight
filter(flights, dep_delay >= 60 , arr_delay < 30)
## # A tibble: 206 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     3     1850           1745        65     2148           2120
##  2  2013     1     3     1950           1845        65     2228           2227
##  3  2013     1     3     2015           1915        60     2135           2111
##  4  2013     1     6     1019            900        79     1558           1530
##  5  2013     1     7     1543           1430        73     1758           1735
##  6  2013     1    11     1020            920        60     1311           1245
##  7  2013     1    12     1706           1600        66     1949           1927
##  8  2013     1    12     1953           1845        68     2154           2137
##  9  2013     1    19     1456           1355        61     1636           1615
## 10  2013     1    21     1531           1430        61     1843           1815
## # … with 196 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

1.7.

  • 9373 flights departed between midnight and 6am (inclusive)
summary(flights$dep_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1     907    1401    1349    1744    2400    8255
# max departure time is 2400, this represents midnight

filter(flights, dep_time <= 600 | dep_time == 2400)
## # A tibble: 9,373 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.

  • The between(x, left, right) function allows you to filter values where x is greater than or equal to left and less than or equal to right.

  • The code for exercise 1.4 can be simplified using the between() function as follows:

# Simplifying code using between() function

filter(flights, between(month, 7, 9))
## # A tibble: 86,326 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # … with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3.

  • 8,255 flights have a missing departure time (dep_time).
  • For these flights the following details are also missing, departure delay (dep_delay), scheduled arrival time (arr_time), arrival delay (arr_delay)and amount of time spent in the air (air_time).
  • These are likely flights that never took off.
filter(flights, is.na(dep_time))
## # A tibble: 8,255 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # … with 8,245 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

4.

  • NA^1 is not missing since mathematically any variable to the power of 0 is 1

  • NA | TRUE is true, since NA represents an unknown value and TRUE appears after the boolean operator “or”, therefore it returns TRUE

  • NA & FALSE is false since NA represents an unknown value and FALSE appears after the boolean operator “and”, therefore it returns FALSE

NA^0
## [1] 1
NA | TRUE
## [1] TRUE
NA & FALSE
## [1] FALSE
NA *0
## [1] NA

5.3.1

1.

  • Using arrange() to sort all missing values to the start (within the dep_delay column)
arrange(flights, desc(is.na(dep_delay)), dep_delay)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1       NA           1630        NA       NA           1815
##  2  2013     1     1       NA           1935        NA       NA           2240
##  3  2013     1     1       NA           1500        NA       NA           1825
##  4  2013     1     1       NA            600        NA       NA            901
##  5  2013     1     2       NA           1540        NA       NA           1747
##  6  2013     1     2       NA           1620        NA       NA           1746
##  7  2013     1     2       NA           1355        NA       NA           1459
##  8  2013     1     2       NA           1420        NA       NA           1644
##  9  2013     1     2       NA           1321        NA       NA           1536
## 10  2013     1     2       NA           1545        NA       NA           1910
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

2.

  • Sorting flights to find the most delayed flights.
  • The most delayed flight was HA 51 traveling from JFK to HNL on Jan 09, 2013
# arrange dep_delay column in descending order

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • Sorting flights to find the flights that left the earliest.
  • The flight that left the earliest was B6 97 traveling from JFK to DEN on Dec 07, 2013
  • The flight departed 43 minutes ahead of scheduled time
# arrange dep_delay column in ascending order

arrange(flights, dep_delay)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

3.

  • Sorting flights to find the fastest (highest speed) flights.
  • speed = distance / time
  • fastest flights were determined by dividing distance/air_time and then arranging in descending order
  • Fastest flight was DL 1499 travelling from LGA to ATL on May 25, 2013
arrange(flights, desc(distance/air_time))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     5    25     1709           1700         9     1923           1937
##  2  2013     7     2     1558           1513        45     1745           1719
##  3  2013     5    13     2040           2025        15     2225           2226
##  4  2013     3    23     1914           1910         4     2045           2043
##  5  2013     1    12     1559           1600        -1     1849           1917
##  6  2013    11    17      650            655        -5     1059           1150
##  7  2013     2    21     2355           2358        -3      412            438
##  8  2013    11    17      759            800        -1     1212           1255
##  9  2013    11    16     2003           1925        38       17             36
## 10  2013    11    16     2349           2359       -10      402            440
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

4.

  • Flight that traveled the farthest is HA 51 travelling form JFK to HNL which traveled 4983 miles
arrange(flights, desc(distance))
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      857            900        -3     1516           1530
##  2  2013     1     2      909            900         9     1525           1530
##  3  2013     1     3      914            900        14     1504           1530
##  4  2013     1     4      900            900         0     1516           1530
##  5  2013     1     5      858            900        -2     1519           1530
##  6  2013     1     6     1019            900        79     1558           1530
##  7  2013     1     7     1042            900       102     1620           1530
##  8  2013     1     8      901            900         1     1504           1530
##  9  2013     1     9      641            900      1301     1242           1530
## 10  2013     1    10      859            900        -1     1449           1530
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
  • Flight that travelled the shortest is US 1632 travelling form EWR to LGA which travelled 17 miles
arrange(flights, distance)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7    27       NA            106        NA       NA            245
##  2  2013     1     3     2127           2129        -2     2222           2224
##  3  2013     1     4     1240           1200        40     1333           1306
##  4  2013     1     4     1829           1615       134     1937           1721
##  5  2013     1     4     2128           2129        -1     2218           2224
##  6  2013     1     5     1155           1200        -5     1241           1306
##  7  2013     1     6     2125           2129        -4     2224           2224
##  8  2013     1     7     2124           2129        -5     2212           2224
##  9  2013     1     8     2127           2130        -3     2304           2225
## 10  2013     1     9     2126           2129        -3     2217           2224
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

5.4.1

1.

  • Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    1. By specifying the four column names
select (flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
    1. By specifying the respective column numbers of the four varialbes
select (flights, 4, 6, 7, 9)
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
    1. By specifying the start_with
select (flights, starts_with("dep"),starts_with("arr"))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
    1. By specifying variables within all_of
select (flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows
    1. By specifying variables within any_of
select (flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
## # A tibble: 336,776 × 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # … with 336,766 more rows

2.

  • What happens if you include the name of a variable multiple times in a select() call?

  • The variable is only displayed once. The duplicate is essentially ignored by the select() function

select (flights, dep_time, dep_time, dep_time)
## # A tibble: 336,776 × 1
##    dep_time
##       <int>
##  1      517
##  2      533
##  3      542
##  4      544
##  5      554
##  6      554
##  7      555
##  8      557
##  9      557
## 10      558
## # … with 336,766 more rows

3.

  • What does the any_of() function do? Why might it be helpful in conjunction with this vector?

  • The any_of() function displays any of the variables, specified within the function

  • Using it in conjunction with the vector that’s been defined as vars below allows you to define the variables to filter out in one vector, prior to using the any_of() function within select()

  • This helps simplify the code written within select ()

  • Further, if you are looking to find any of the same variables, within multiple tables, it allows for cleaner duplication of the code (without having to list the variables multiple times)

4.

  • Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

  • It is surprising that the specified function ignores the case, by default

select(flights, contains("TIME"))
## # A tibble: 336,776 × 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
##       <int>          <int>    <int>          <int>    <dbl> <dttm>             
##  1      517            515      830            819      227 2013-01-01 05:00:00
##  2      533            529      850            830      227 2013-01-01 05:00:00
##  3      542            540      923            850      160 2013-01-01 05:00:00
##  4      544            545     1004           1022      183 2013-01-01 05:00:00
##  5      554            600      812            837      116 2013-01-01 06:00:00
##  6      554            558      740            728      150 2013-01-01 05:00:00
##  7      555            600      913            854      158 2013-01-01 06:00:00
##  8      557            600      709            723       53 2013-01-01 06:00:00
##  9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # … with 336,766 more rows
  • Adding the following argument however changes the default to NOT ignore case
select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 × 0

5.5.2

1.

  • Computing dep_time and sched_dep_time to a more convenient representation of number of minutes since midnight
# dep_time

transmute(flights,
          dep_time,
  dep_time_hour = dep_time %/% 100,
  dep_time_minute = dep_time %% 100,
  
)
## # A tibble: 336,776 × 3
##    dep_time dep_time_hour dep_time_minute
##       <int>         <dbl>           <dbl>
##  1      517             5              17
##  2      533             5              33
##  3      542             5              42
##  4      544             5              44
##  5      554             5              54
##  6      554             5              54
##  7      555             5              55
##  8      557             5              57
##  9      557             5              57
## 10      558             5              58
## # … with 336,766 more rows
# sched_dep_time

transmute(flights,
          sched_dep_time,
  sched_dep_time_hour = dep_time %/% 100,
  sched_dep_time_minute = dep_time %% 100,
  
)
## # A tibble: 336,776 × 3
##    sched_dep_time sched_dep_time_hour sched_dep_time_minute
##             <int>               <dbl>                 <dbl>
##  1            515                   5                    17
##  2            529                   5                    33
##  3            540                   5                    42
##  4            545                   5                    44
##  5            600                   5                    54
##  6            558                   5                    54
##  7            600                   5                    55
##  8            600                   5                    57
##  9            600                   5                    57
## 10            600                   5                    58
## # … with 336,766 more rows

2.

  • Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  • We would expect the computed value to be the same as the air_time value however values within arr_time - dep_time are > than air_time

  • This is because the values within arr_time and dep_time are displays of times an not in a format to run the calculation of the difference in times.

  • The first step that needs to be completed is to convert arr_time and dep_time to minutes

  • However completing this step and calculating the difference between arr_time and dep_time, still does not provide values that are equivalent to those in the air_time column. In fact, some values are negative

  • This is likely due to changes in time zone not being accounted for in our calculation. All arrival and departure times will first need to be converted to the same time zone, prior to completing the calculation for the values in both columns to be the same.

transmute(flights, 
       air_time,
       air_time_calculated = arr_time - dep_time,
       )
## # A tibble: 336,776 × 2
##    air_time air_time_calculated
##       <dbl>               <int>
##  1      227                 313
##  2      227                 317
##  3      160                 381
##  4      183                 460
##  5      116                 258
##  6      150                 186
##  7      158                 358
##  8       53                 152
##  9      140                 281
## 10      138                 195
## # … with 336,766 more rows
# convert arrival time and departure time to hours and minutes
# this conversion converts midnight (2400) to 1440
# %% 1440 will convert these values to 0

transmute(flights, 
       air_time,
       arr_time_mins = (arr_time %/% 100 *60 + arr_time %% 100 *60) %% 1440,
       dep_time_mins = (dep_time %/% 100 *60 + dep_time %% 100 *60) %% 1440,
       arr_dep_diff =  arr_time_mins - dep_time_mins
       )
## # A tibble: 336,776 × 4
##    air_time arr_time_mins dep_time_mins arr_dep_diff
##       <dbl>         <dbl>         <dbl>        <dbl>
##  1      227           840          1320         -480
##  2      227           600           840         -240
##  3      160           480          1380         -900
##  4      183           840            60          780
##  5      116          1200           660          540
##  6      150          1380           660          720
##  7      158          1320           720          600
##  8       53           960           840          120
##  9      140          1320           840          480
## 10      138           720           900         -180
## # … with 336,766 more rows

3.

  • Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

  • We would expect dep_delay = dep_time - sched_dep_time

  • However running the calculation below, we observe that this is note the case.

  • As described above, this is likely due to the values within dep_time and sched_dep_time not being listed in minutes. These conversions would first need to be completed in order to complete this calculation

transmute (flights, 
        dep_time, 
        sched_dep_time, 
        dep_delay,
        dep_diff = dep_time - sched_dep_time,
        difference = dep_delay -dep_diff
        )
## # A tibble: 336,776 × 5
##    dep_time sched_dep_time dep_delay dep_diff difference
##       <int>          <int>     <dbl>    <int>      <dbl>
##  1      517            515         2        2          0
##  2      533            529         4        4          0
##  3      542            540         2        2          0
##  4      544            545        -1       -1          0
##  5      554            600        -6      -46         40
##  6      554            558        -4       -4          0
##  7      555            600        -5      -45         40
##  8      557            600        -3      -43         40
##  9      557            600        -3      -43         40
## 10      558            600        -2      -42         40
## # … with 336,766 more rows

4

  • Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().

  • The min_rank() function assigns the same rank, when the values are tied, which makes sense for the type of ranking that we are trying to complete for the purposes of this exercise

  • Further, there are no tied values in the top 10 flights that were delayed

# create a data frame with delay rank included

 flights1 <- mutate(flights,
          delay_rank = min_rank(dep_delay)
          )

flights1
## # A tibble: 336,776 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   delay_rank <int>
# arrange dataframe in descending order of delay rank
# use head function to extract first 10 rows that represent the top 10 most delayed flights

head (arrange(flights1, desc(delay_rank)), 10)
## # A tibble: 10 × 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # … with 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, delay_rank <int>

5

  • What does 1:3 + 1:10 return? Why?
  • This code returns the following result:
  • 2 4 6 5 7 9 8 10 12 11
  • Which is essentially calculated as follows:
  • 1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 1 + 10
  • since the first vector is smaller than the second, for the, R starts again from the first value of the shorter vector
1:3 + 1:10
##  [1]  2  4  6  5  7  9  8 10 12 11

6

  • What trigonometric functions does R provide?

  • All trigonometric functions provided by R can be found under the help documentation for Trig

  • R provides the following functions:

    • sine, cosine, tangent coded as sin(x), cos(x), tan(x), respectively
    • cospi(x) = cos(pi * x), sinpi(x) = sin(pi * x), tanpi(x) = tan(pi *x)
    • arc-cosine, arc-sine and arc-tangent coded as acos(x), asin(x),atan(x),
  • And finally, atan2(y, x), which provides the angle between the x-axis and the vector from (0,0) to (x, y).

?Trig

5.6

1.

  • Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

  • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

  • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.

  • A flight is always 10 minutes late.

  • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.

  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.

  • Which is more important: arrival delay or departure delay?

  • Ultimately what is most important to consider is what the organizational problem / decision that the data helps inform is.

  • Assessing the delay characteristics in the options mentioned above allows for trying to find answers to the root cause of these delays in multiple ways

  • If the analysis is used to help understand how to improve airport processes to minimize delays within the departure airport, then departure delay may be more important.

  • Alternatively if the data is used to help understand airport processes within the arrival airport could be improved, arrival delay would be more important.

  • Further, if the data is used to understand overall costs associated with delays, then both variables may be eqully important to understand in depth.

  • Thus, the research question at hand / the organizational problem in question will help determine how best to group data to inform decisions and which variables to focus on

2.

  • Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>%
  count(dest)
## # A tibble: 104 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # … with 94 more rows
# a total count of flights that were not cancelled by destination is provided

# alternative method to generate this using group_by

not_cancelled %>%
  group_by(dest)%>%
  summarise (
    total_not_cancelled = n()
                     )
## # A tibble: 104 × 2
##    dest  total_not_cancelled
##    <chr>               <int>
##  1 ABQ                   254
##  2 ACK                   264
##  3 ALB                   418
##  4 ANC                     8
##  5 ATL                 16837
##  6 AUS                  2411
##  7 AVL                   261
##  8 BDL                   412
##  9 BGR                   358
## 10 BHM                   269
## # … with 94 more rows
not_cancelled %>% count(tailnum, wt = distance)
## # A tibble: 4,037 × 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # … with 4,027 more rows
# alternative without using count

not_cancelled %>%
  group_by(tailnum)%>%
  summarise (
    total = sum(distance)
  )
## # A tibble: 4,037 × 2
##    tailnum  total
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # … with 4,027 more rows

3

  • Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?

  • all arr_delay rows = NA however all dep_delay rows are not NA

  • That is, there are rows in which a value is logged for dep_delay yet may be missing for arr_del

  • This may be for flights that perhaps were delayed in taking off and then had to reroute and did not make it to the final destination

  • Thus the most important column is the arr_delay column

flights %>% 
  filter(is.na(dep_delay) | is.na(arr_delay)) %>%
  select (dep_delay, arr_delay)%>%
  filter (!is.na(arr_delay))
## # A tibble: 0 × 2
## # … with 2 variables: dep_delay <dbl>, arr_delay <dbl>
# all arr_delay rows = NA however all dep_delay rows are not NA

5

Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

F9, EV, and YV had the highest average delays. US and HA had the lowest average delays in 2013.

flights %>%
  filter(!is.na(dep_delay)) %>% 
  group_by(carrier) %>%
  summarise(carrier_delay = mean(dep_delay)) %>% 
  arrange(desc(carrier_delay)) 
## # A tibble: 16 × 2
##    carrier carrier_delay
##    <chr>           <dbl>
##  1 F9              20.2 
##  2 EV              20.0 
##  3 YV              19.0 
##  4 FL              18.7 
##  5 WN              17.7 
##  6 9E              16.7 
##  7 B6              13.0 
##  8 VX              12.9 
##  9 OO              12.6 
## 10 UA              12.1 
## 11 MQ              10.6 
## 12 DL               9.26
## 13 AA               8.59
## 14 AS               5.80
## 15 HA               4.90
## 16 US               3.78
flights %>%
  group_by(carrier) %>% 
  filter(!is.na(dep_delay)) %>% 
  summarize(avg_delay = mean(dep_delay)) 
## # A tibble: 16 × 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 9E          16.7 
##  2 AA           8.59
##  3 AS           5.80
##  4 B6          13.0 
##  5 DL           9.26
##  6 EV          20.0 
##  7 F9          20.2 
##  8 FL          18.7 
##  9 HA           4.90
## 10 MQ          10.6 
## 11 OO          12.6 
## 12 UA          12.1 
## 13 US           3.78
## 14 VX          12.9 
## 15 WN          17.7 
## 16 YV          19.0

6

What does the sort argument to count() do. When might you use it?

The sort argument of the count() function allows the user to show the largest groups that were counted at the top of the data frame that is returned. This would be a good argument to use when wanting to print counts in descending order, similar to arrange(desc(x)). The code below uses the sort argument in count() to show which airlines logged the most flights in the data set.

flights %>%
  count(carrier, sort = TRUE) 
## # A tibble: 16 × 2
##    carrier     n
##    <chr>   <int>
##  1 UA      58665
##  2 B6      54635
##  3 EV      54173
##  4 DL      48110
##  5 AA      32729
##  6 MQ      26397
##  7 US      20536
##  8 9E      18460
##  9 WN      12275
## 10 VX       5162
## 11 FL       3260
## 12 AS        714
## 13 F9        685
## 14 YV        601
## 15 HA        342
## 16 OO         32

5.7.1

1.

  • Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

Using group_by prior to filtering or mutating means that filters or calculations made will be done to within each level of the grouping variable. As a result, in the example below, when we grouped by month, and then calcualted the average delay, it did so for each month. Mutate returned this number as a new column on the data set. The within month calculations can be seen by the repeating numbers for each row. The repeated numbers are expected since we wanted to take the mean within the month.

flights %>%
  group_by(month) %>% 
  filter(!is.na(dep_delay)) %>% 
  mutate(dep_delay_mean = mean(dep_delay)) %>% 
  select(month, dep_delay_mean) %>% head(10) 
## # A tibble: 10 × 2
## # Groups:   month [1]
##    month dep_delay_mean
##    <int>          <dbl>
##  1     1           10.0
##  2     1           10.0
##  3     1           10.0
##  4     1           10.0
##  5     1           10.0
##  6     1           10.0
##  7     1           10.0
##  8     1           10.0
##  9     1           10.0
## 10     1           10.0

2.

  • Which plane (tailnum) has the worst on-time record?
  • Defining worst on-record time as flight with highest mean arrival delay time
  • N844MH is the plane with the worst on-time record
flights %>%
  filter(!is.na(tailnum), !is.na(arr_time)) %>%
  group_by(tailnum) %>%
  summarise(mean_delay = mean(arr_delay, n=n())) %>%
  arrange (desc(mean_delay))
## # A tibble: 4,037 × 2
##    tailnum mean_delay
##    <chr>        <dbl>
##  1 N844MH        320 
##  2 N911DA        294 
##  3 N922EV        276 
##  4 N587NW        264 
##  5 N851NW        219 
##  6 N928DN        201 
##  7 N7715E        188 
##  8 N654UA        185 
##  9 N665MQ        175.
## 10 N427SW        157 
## # … with 4,027 more rows

3.

  • What time of day should you fly if you want to avoid delays as much as possible?
  • Morning flights at 7 am tend to be the best time to avoid delays as much as possible
flights %>%
  group_by(hour) %>%
  summarise(mean_delay = mean(arr_delay, n=n())) %>%
  arrange((mean_delay))
## # A tibble: 20 × 2
##     hour mean_delay
##    <dbl>      <dbl>
##  1     7     -5.30 
##  2     5     -4.80 
##  3     6     -3.38 
##  4     9     -1.45 
##  5     8     -1.11 
##  6    10      0.954
##  7    11      1.48 
##  8    12      3.49 
##  9    13      6.54 
## 10    14      9.20 
## 11    23     11.8  
## 12    15     12.3  
## 13    16     12.6  
## 14    18     14.8  
## 15    22     16.0  
## 16    17     16.0  
## 17    19     16.7  
## 18    20     16.7  
## 19    21     18.4  
## 20     1    NaN

4.

  • For each destination, compute the total minutes of delay.
flights %>%
  filter(arr_delay >0) %>%
  group_by(dest) %>%
  summarise(total_delay = sum(arr_delay, n=n())) %>%
  arrange((total_delay))
## # A tibble: 103 × 2
##    dest  total_delay
##    <chr>       <dbl>
##  1 PSP            42
##  2 ANC            67
##  3 HDN           127
##  4 SBN           129
##  5 MTJ           175
##  6 EYW           204
##  7 BZN           507
##  8 JAC           636
##  9 CHO           966
## 10 MYR          1037
## # … with 93 more rows
  • For each flight, compute the proportion of the total delay for its destination.
flights %>%
  filter(arr_delay >0) %>%
  group_by(dest) %>%
  mutate(
    total_delay = sum(arr_delay),
    prop_delay = arr_delay / total_delay)%>%
  arrange((total_delay))
## # A tibble: 133,004 × 21
## # Groups:   dest [103]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    26     1051           1055        -4     1401           1400
##  2  2013    12    21     1040           1030        10     1336           1335
##  3  2013     2     2     1051           1055        -4     1404           1400
##  4  2013     2    23     1053           1055        -2     1408           1400
##  5  2013     3    16     1046           1055        -9     1412           1355
##  6  2013     3    23     1050           1055        -5     1400           1355
##  7  2013     7     6     1629           1615        14     1954           1953
##  8  2013     7    13     1618           1615         3     1955           1953
##  9  2013     7    20     1618           1615         3     2003           1953
## 10  2013     8     3     1615           1615         0     2003           1953
## # … with 132,994 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   total_delay <dbl>, prop_delay <dbl>

5.

  • Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
  • Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
delay_lags <- flights %>%
  group_by(origin) %>%
  mutate(dep_delay_lag = lag(dep_delay)) %>%
  filter(!is.na(dep_delay), !is.na(dep_delay_lag))


ggplot(data = delay_lags, mapping = aes(x = dep_delay_lag, y = dep_delay)) +
          geom_point () +
  scale_x_continuous(breaks = seq(0, 1500, by = 100)) +
  labs(y = "Flight Departure Delay", x = "Previous Flight Departure Delay")

#### 6 - Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).

Flight DL 1499 had a speed of 703 miles per hour. That is substantially faster than most of the other recorded speeds. It is possible that distance for this flight was entered as kilometers and not miles.

  • Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
# fast flights
flights %>%
  mutate(hour = air_time/60, speed = distance/hour) %>% 
  arrange(desc(speed)) %>% select(flight, speed) %>% head(10) 
## # A tibble: 10 × 2
##    flight speed
##     <int> <dbl>
##  1   1499  703.
##  2   4667  650.
##  3   4292  648 
##  4   3805  641.
##  5   1902  591.
##  6    315  564 
##  7    707  557.
##  8    936  556.
##  9    347  554.
## 10   1503  554.
fast_flights <- flights %>%
  filter(!is.na(air_time))%>%
  group_by(dest) %>%
  mutate(
    shortest_air_time = min(air_time),
    relative_air_time = air_time / shortest_air_time
  )%>%
    arrange (desc(relative_air_time))

fast_flights
## # A tibble: 327,346 × 21
## # Groups:   dest [104]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     6    24     1932           1920        12     2228           2047
##  2  2013     6    17     1652           1700        -8     1856           1815
##  3  2013     7    23     1605           1400       125     1831           1511
##  4  2013     7    23     1617           1605        12     1833           1740
##  5  2013     7    23     1242           1237         5     1437           1351
##  6  2013     7    23     1200           1200         0     1428           1317
##  7  2013     2    17      841            840         1     1044           1003
##  8  2013     7    23     1242           1245        -3     1429           1355
##  9  2013     6    10     1356           1300        56     1646           1414
## 10  2013     6    29      755            800        -5     1035            909
## # … with 327,336 more rows, and 13 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   shortest_air_time <dbl>, relative_air_time <dbl>

7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>% 
  group_by(dest, carrier) %>%
  select(dest, carrier) %>%
  ungroup() %>% 
  distinct(dest, carrier) %>% # unique destination and carrier combo 
  count(dest) %>% arrange(n) # show destinations with only one carrier 
## # A tibble: 105 × 2
##    dest      n
##    <chr> <int>
##  1 ABQ       1
##  2 ACK       1
##  3 ALB       1
##  4 ANC       1
##  5 BHM       1
##  6 BUR       1
##  7 BZN       1
##  8 CAK       1
##  9 CHO       1
## 10 CRW       1
## # … with 95 more rows
# same tidy chunk as above but now filtering out destination not = 1 carrier 
  
multiple_carriers <- flights %>% 
  group_by(dest, carrier) %>%
  select(dest, carrier) %>%
  ungroup() %>% 
  distinct(dest, carrier) %>% 
  count(dest) %>% filter(!n == 1) 

multiple_carriers %>% 
  arrange(desc(n)) # shows destinations with greatest number of carriers 
## # A tibble: 76 × 2
##    dest      n
##    <chr> <int>
##  1 ATL       7
##  2 BOS       7
##  3 CLT       7
##  4 ORD       7
##  5 TPA       7
##  6 AUS       6
##  7 DCA       6
##  8 DTW       6
##  9 IAD       6
## 10 MSP       6
## # … with 66 more rows

8. For each plane, count the number of flights before the first delay of greater than 1 hour.

flights %>%
  select(tailnum, year, month,day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  mutate(cumulative_min = cumsum(dep_delay)) %>% # using a cumulative function to sequentially add the dep_delay minutes 
  filter(cumulative_min < 60) %>%
  count(tailnum) %>% arrange(desc(n)) %>% head(15) # finally show flight number with total number of fligths 'n' prior to its first 60 min delay. Using a header of full data to limit output for report. 
## # A tibble: 15 × 2
## # Groups:   tailnum [15]
##    tailnum     n
##    <chr>   <int>
##  1 N957UW    286
##  2 N952UW    282
##  3 N954UW    237
##  4 N961UW    226
##  5 N956UW    217
##  6 N747UW    171
##  7 N951UW    171
##  8 N947UW    162
##  9 N948UW    142
## 10 N958UW    130
## 11 N730MQ    120
## 12 N766US    117
## 13 N953UW    117
## 14 N705TW    111
## 15 N945UW    105

Bulut & Dejardins (2019) Chapter 4

Wrangling big data with data.table

This chapter demonstrates the application of the data.table in R as an alternative to tidyverse and base R methods of data wrangling. The authors suggest that data.table is particularly useful for working with large volumes of data (e.g. 10 - 100 GB)

library(data.table) 

4.2.1 Exercises

4.3.1 Exercises

1. Subset all the Female students (ST004D01T) in Germany.

2. How many female students are there in Germany?

3. The .N function returns the length of a vector/number of rows. Use chaining with the .N function to answer Exercise 2.

Using data.table syntax to subset the data set, we found that there are 3,197 female students from Germany in the data set. In addition, an alternative method using the .N function to return the number of rows is shown below as well.

# explore data; first 5 and last 5 rows returned when printing a data.table object 
#random6 # don't run for now due to loading speed 

# subset all female students from the field ST004D01T 

#random6[CNT == "Germany" & ST004D01T == "Female"]

# return row count with .N function and chaining 
#random6[CNT == "Germany" & ST004D01T == "Female", .N] # don't print code as the output is very large in the markdown report 

4.4.1 Exercises

1. The computer and software variables that were created above ask a student whether they had a computer in their home that they can use for school work (computer) and whether they had educational software in their home (software). Find the proportion of students in the Germany and Uruguay that have a computer in their home or have educational software.

# first create custom function to convert bin to numeric  

bin.to.num <- function(x){
  if (is.na(x)) NA
  else if (x == "Yes") 1L
  else if (x == "No") 0L
}


# using the custom function, create new variables, computer and software, that we can use to calculate the proportion of students with access to these resources  


random6[, `:=` 
     (female = ifelse(ST004D01T == "Female", 1, 0),
       sex = ST004D01T,
       
       # At my house we have ...
       desk = sapply(ST011Q01TA, bin.to.num),
       own.room = sapply(ST011Q02TA, bin.to.num),
       quiet.study = sapply(ST011Q03TA, bin.to.num),
       computer = sapply(ST011Q04TA, bin.to.num),
       software = sapply(ST011Q05TA, bin.to.num),
       internet = sapply(ST011Q06TA, bin.to.num),
       lit = sapply(ST011Q07TA, bin.to.num),
       poetry = sapply(ST011Q08TA, bin.to.num),
       art = sapply(ST011Q09TA, bin.to.num),
       book.sch = sapply(ST011Q10TA, bin.to.num),
       tech.book = sapply(ST011Q11TA, bin.to.num),
       dict = sapply(ST011Q12TA, bin.to.num),
       art.book = sapply(ST011Q16NA, bin.to.num))]

In Germany, 95% of students in Germany have a computer in their home and 45% of students have educational software.

random6[CNTRYID == "Germany", table(computer)] # 247 No's and 5438 Yes' 
## computer
##    0    1 
##  247 5438
# computer prop 
5438 /(5438 + 247)*100 #  95% of students in data set in Germany have a computer in
## [1] 95.65523
random6[CNTRYID == "Germany", table(software)] # 3038 No's and 2481 Yes' 
## software
##    0    1 
## 3038 2481
# educational software prop 
2481/(2481 + 3038) # 45% have access to educational software 
## [1] 0.449538

In Uruguay, 88% of students have a computer in the home and 43% of students have educational software.

random6[CNTRYID == "Uruguay", table(computer)] # 635 No's and 5099 Yes' 
## computer
##    0    1 
##  635 5099
# computer in home prop 
5099/(5099 + 635) # 88% have access to educational software 
## [1] 0.8892571
random6[CNTRYID == "Uruguay", table(software)] # 3013 No's and 2313 Yes' 
## software
##    0    1 
## 3013 2313
# educational software prop 
2313/(2313 + 3013) # 43% have access to educational software 
## [1] 0.4342846

2. For just female students, find the proportion of students who have their own room (own.room) or a quiet place to study (quiet.study).

Among female students in the random6 data set, 72% stated that they have their own room and 85% stated that they have a quiet place to study.

random6[female == 1, table(own.room)] # 4805 No's and 12644 Yes' 
## own.room
##     0     1 
##  4805 12644
# prop own.room

12644 / (12644 + 4805) #72% 
## [1] 0.7246261
random6[female == 1, table(quiet.study)] #2606 No's and 14801 Yes' 
## quiet.study
##     0     1 
##  2606 14801
# prop quiet.study 

14801 / (14801 + 2606) # 85% 
## [1] 0.8502901

4.5.1 Exercises

1. Calculate the proportion of students who have art in their home (art) and the average age (AGE) of the students by gender.

51% of students have art in their home. The average age of male and female students are 15.8 years.

# proportion of students who have art in their homes: 
random6[, .(art = mean(art, na.rm = TRUE), AGE = mean(AGE, na.rm = TRUE)), by = .(sex)] 
##       sex       art      AGE
## 1: Female 0.5061029 15.81118
## 2:   Male 0.4981741 15.80671

2. Within a by argument you can discretize a variable to create a grouping variable. Perform a median split for age within the by argument and assess whether there are age difference associated with having your own room (own.room) or a desk (desk).

# create custom function to split a column of data by above or below median 
median_cut = function(x) { 
  ifelse(x > median(x), "above_median", "below_median")
}


random6[,
     .(own.room = mean(own.room, na.rm = TRUE), 
       desk_mean = mean(desk, na.rm = TRUE)), 
     by = .(median_cut(AGE)) ] 
##      median_cut  own.room desk_mean
## 1: above_median 0.7439972 0.8684374
## 2: below_median 0.7441766 0.8680203

The authors show that the code below, using the melt() function, can be used to reshape the data sets from wide to long formats.

4.8 Lab

1. This afternoon when we discuss supervised learning, we’ll ask you to develop some models to predict the response to the question Do you expect your child will go into a ?” (PA032Q03TA). Recode this variable so that a “Yes” is 1 and a “No” is a -1 and save the variable as sci_car.

# recode and store as 'sci_car'; use table to confirm re-coding
random6[, 
        "sci_car" := sapply(PA032Q03TA, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x == "No") -1L 
                              else if (x == "Yes") 1L 
                              
                              }) ][,
                                    table(sci_car)]
## sci_car
##   -1    1 
## 5104 4946

2. Calculate descriptives for this variable by sex and country. Specifically, the proportion of test takers whose parents said “Yes” or 1.

Because we are using the more limited random6 data set, we specify to return Germany and Mexico specifically, since the other countries are not in the data set.

random6[CNTRYID %in% c("Germany", "Mexico"), 
        .(mean(sci_car, na.rm = TRUE)), 
        by = .(sex, CNTRYID)]
##       sex CNTRYID         V1
## 1: Female Germany -0.8348018
## 2:   Male Germany -0.7085292
## 3:   Male  Mexico  0.3975610
## 4: Female  Mexico  0.3200577

3. Means and standard deviations (sd) for the variables that you think will be most predictive of sci_car.

# means and sd of variables hypothesized to predict sci_car
random6[, 
        .(environ_avg =  mean(ENVAWARE, na.rm = TRUE), 
          environ_sd = sd(ENVAWARE, na.rm = TRUE), 
          joy_avg = mean(JOYSCIE, na.rm = TRUE), 
          joy_sd = sd(JOYSCIE, na.rm = TRUE), 
          self_eff_avg = mean(SCIEEFF, na.rm = TRUE), 
          self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
          )]
##    environ_avg environ_sd    joy_avg   joy_sd self_eff_avg self_eff_sd
## 1:  -0.1688647   1.097327 0.06985556 1.117806 -0.008148365    1.204854

4. Calculate these same descriptives by groups (by sci_car and by sex).

# means and sd of focal variables, grouped by sci_car and sex 
random6[, 
        .(environ_avg =  mean(ENVAWARE, na.rm = TRUE), 
          environ_sd = sd(ENVAWARE, na.rm = TRUE), 
          joy_avg = mean(JOYSCIE, na.rm = TRUE), 
          joy_sd = sd(JOYSCIE, na.rm = TRUE), 
          self_eff_avg = mean(SCIEEFF, na.rm = TRUE), 
          self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
          ), by = .(sci_car, sex)]
##    sci_car    sex environ_avg environ_sd     joy_avg    joy_sd self_eff_avg
## 1:      NA Female -0.31842610  0.9874307 -0.09216258 1.1173348 -0.171922907
## 2:      -1 Female -0.01864386  0.9728241 -0.11241710 1.0385336  0.016697673
## 3:       1   Male  0.18221720  1.1622236  0.55847968 1.0094256  0.361605273
## 4:      -1   Male  0.06017463  1.1645145  0.12967621 1.0792426  0.162395292
## 5:      NA   Male -0.20814244  1.1916179  0.06879025 1.1356331 -0.009655823
## 6:       1 Female  0.08936607  0.9659264  0.55664182 0.9397745  0.311395444
##    self_eff_sd
## 1:    1.186161
## 2:    1.088183
## 3:    1.107026
## 4:    1.141781
## 5:    1.267313
## 6:    1.059443

5. Calculate correlations between these variables and sci_car

# first store smaller data set containing observations for all focal variables and sci_car 
sci_car_variables <- random6[, 
        .(sci_car, ENVAWARE, JOYSCIE, SCIEEFF)]

sci_car_variables[, 
                   .(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
                   joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"), 
                   env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"), 
                   sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"))]
##    env_joy_cor joy_self_eff_cor env_sci_car_cor sci_car_joy_cor
## 1:   0.3650092        0.3482787      0.05626268       0.2666819

6. Create new variables: Discretize the math and reading variables using the OECD means (490 for math and 493) and code them as 1 (at or above the mean) and -1 (below the mean), but do in the data.table way without using the $ operator.

# recode and store as 'sci_car'; use table to confirm re-coding
random6[, 
        "math_split" := sapply(PV1MATH, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x <= 490) -1L 
                              else if (x > 490) 1L 
                              
                              })][,
                                    table(math_split)]
## math_split
##    -1     1 
## 21578 14269
random6[, 
               "reading_split" := sapply(PV1READ, 
                            function(x) { 
                              if (is.na(x)) NA
                              else if (x <= 493) -1L 
                              else if (x > 493) 1L}) ][, 
                                                     table(reading_split)]
## reading_split
##    -1     1 
## 21191 14656
# calculate correlations with these new variables and those above 
# store as table for subsequent formatting 
math_reading_cor <- random6[, 
                   .(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
                   joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"), 
                   env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"), 
                   sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"), 
                   math_joy_cor = cor(x = math_split, y = sci_car, use = "complete.obs"), 
                   reading_joy_cor = cor(x = reading_split, y = sci_car, use = "complete.obs"), 
                   math_env_cor = cor(x = ENVAWARE, y = math_split, use = "complete.obs"), 
                   reading_env_cor = cor(x = JOYSCIE, y = reading_split, use = "complete.obs"), 
                   math_self_eff_cor = cor(x = SCIEEFF, y = math_split, use = "complete.obs"), 
                  reading_self_eff_cor = cor(x = SCIEEFF, y = reading_split, use = "complete.obs"))] 

# pivot long using melt(); easier to view correlations this way 
melt(math_reading_cor) 
##                 variable       value
##  1:          env_joy_cor  0.36500924
##  2:     joy_self_eff_cor  0.34827868
##  3:      env_sci_car_cor  0.05626268
##  4:      sci_car_joy_cor  0.26668187
##  5:         math_joy_cor -0.21986258
##  6:      reading_joy_cor -0.20404487
##  7:         math_env_cor  0.12821305
##  8:      reading_env_cor  0.05982742
##  9:    math_self_eff_cor  0.02318556
## 10: reading_self_eff_cor  0.04272108

7. Chain together a set of operations: For example, create an intermediate variable that is the average of JOYSCIE and INTBRSCI, and then calculate the mean by country by sci_car through chaining.

random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, .SDcols = c("JOYSCIE", "INTBRSCI")][,,by = sci_car]
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## Ignoring by= because j= is not supplied
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## i and j are both missing so ignoring the other arguments. This warning will be
## upgraded to error in future.
##        CNTRYID     Mean
##     1: Germany -0.68550
##     2: Germany -1.62215
##     3: Germany -0.97025
##     4: Germany  1.03440
##     5: Germany  0.06700
##    ---                 
## 35843: Uruguay  1.16175
## 35844: Uruguay  0.22755
## 35845: Uruguay -1.29245
## 35846: Uruguay -0.31095
## 35847: Uruguay -0.54705

8. Transform variables, specifically recode MISCED and FISCED from characters to numeric variables.

# create column subset to convert to numeric
cols_to_convert <- c("MISCED", "FISCED") 

# convert to numeric with subset index and using lapply() 
random6[ , 
           (cols_to_convert) := lapply(.SD, as.numeric),
           .SDcols = c("MISCED", "FISCED")]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

9. Examine other variables in the pisa data set that you think might be predictive of PA032Q03TA.

library(psych) 
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
random6[, lapply(.SD, mean), .SDcols = c("DISCLISCI", "TEACHSUP", "IBTEACH", "TDTEACH")]
##    DISCLISCI TEACHSUP IBTEACH TDTEACH
## 1:        NA       NA      NA      NA
# run descriptive statistics for three additional variables hypothesized to relate to PA032Q03TA 
descriptive_stats <- random6[ ,lapply(.SD, psych::describe), 
         .SDcols = c("DISCLISCI", "TEACHSUP","TDTEACH")] 

# reshape table into long format so that it is easier to read the output
melt(descriptive_stats) 
## Warning in melt.data.table(descriptive_stats): id.vars and measure.vars are
## internally guessed when both are 'NULL'. All non-numeric/integer/logical type
## columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
##               variable         value
##  1:     DISCLISCI.vars  1.000000e+00
##  2:        DISCLISCI.n  3.140600e+04
##  3:     DISCLISCI.mean  1.445539e-01
##  4:       DISCLISCI.sd  1.002982e+00
##  5:   DISCLISCI.median  3.900000e-03
##  6:  DISCLISCI.trimmed  1.504836e-01
##  7:      DISCLISCI.mad  1.032186e+00
##  8:      DISCLISCI.min -2.416200e+00
##  9:      DISCLISCI.max  1.883700e+00
## 10:    DISCLISCI.range  4.299900e+00
## 11:     DISCLISCI.skew -1.627706e-01
## 12: DISCLISCI.kurtosis -2.219641e-01
## 13:       DISCLISCI.se  5.659616e-03
## 14:      TEACHSUP.vars  1.000000e+00
## 15:         TEACHSUP.n  3.062000e+04
## 16:      TEACHSUP.mean  1.259763e-01
## 17:        TEACHSUP.sd  9.746136e-01
## 18:    TEACHSUP.median  4.540000e-02
## 19:   TEACHSUP.trimmed  1.846121e-01
## 20:       TEACHSUP.mad  9.899320e-01
## 21:       TEACHSUP.min -2.719500e+00
## 22:       TEACHSUP.max  1.447500e+00
## 23:     TEACHSUP.range  4.167000e+00
## 24:      TEACHSUP.skew -3.721514e-01
## 25:  TEACHSUP.kurtosis -1.845543e-01
## 26:        TEACHSUP.se  5.569675e-03
## 27:       TDTEACH.vars  1.000000e+00
## 28:          TDTEACH.n  3.012000e+04
## 29:       TDTEACH.mean -2.597529e-02
## 30:         TDTEACH.sd  9.603662e-01
## 31:     TDTEACH.median -1.170000e-02
## 32:    TDTEACH.trimmed -3.303284e-02
## 33:        TDTEACH.mad  8.873361e-01
## 34:        TDTEACH.min -2.447600e+00
## 35:        TDTEACH.max  2.078100e+00
## 36:      TDTEACH.range  4.525700e+00
## 37:       TDTEACH.skew -7.223483e-03
## 38:   TDTEACH.kurtosis  3.790284e-01
## 39:         TDTEACH.se  5.533621e-03
##               variable         value